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Abstract 

Perfect sorting by reversals, a problem originating in computational genomics, is the process of 
sorting a signed permutation to either the identity or to the reversed identity permutation, by a se- 
£SJ ■ quence of reversals that do not break any common interval. Berard et al. (2007) make use of strong 

interval trees to describe an algorithm for sorting signed permutations by reversals. Combinatorial 
properties of this family of trees are essential to the algorithm analysis. Here, we use the expected 
value of certain tree parameters to prove that the average run-time of the algorithm is at worst, 
polynomial, and additionally, for sufficiently long permutations, the sorting algorithm runs in poly- 
nomial time with probability one. Furthermore, our analysis of the subclass of commuting scenarios 
yields precise results on the average length of a reversal, and the average number of reversals. 
A preliminary version of this work appeared in the proceedings of Combinatorial Pattern Matching 
(CPM) 2009, Lectures Notes in Computer Science, vol. 5577, pp. 314-325, Springer. 

1 Introduction 

& . There are many examples where the average case complexity of a sorting algorithm is neatly computed 

with a generating function computation on a related family of trees. Most of the heavy lifting is done 
by complex analysis. We give a new example here: we perform an average case analysis of a sorting 
algorithm from computational genomics by generating function analysis of a family of trees. 

o . 

Motivation: a computational genomics problem. With the availability of a growing number of 
ON ' sequenced and assembled genomes, the comparison of whole genomes in terms of large-scale evolutionary 

events called genome rearrangements is a fundamental task in computational genomics. Computing a 
genomic distance and/or a parsimonious evolutionary scenario between a pair of genomes is one of the 
basics problems in this field, with applications such as reconstructing phylogenies |25j or unraveling 
evolutionary properties of groups of genomes [241 126) . This general problem was formally introduced as 
an algorithmic problem by Sankoff in [27] . Since then, these questions have been extensively investigated, 
for different models of genomes and genome rearrangements, leading to a rich corpus of combinatorial 
and algorithmic results; we refer the reader to the recent book by Fertin et al. on this topic [15] . 

Signed permutations, reversals and scenarios. In this work, we study the problem of computing 
parsimonious perfect reversal scenarios between unichromosomal genomes. Unichromosomal genomes can 
be modeled by signed permutations: each element of a permutation corresponds to a genomic marker 
(a gene for example but not exclusively), defined as a segment of the double-stranded DNA molecule 
forming a chromosome segment, with its sign indicating which strand of the chromosome carries the 
marker. 

A reversal is an evolutionary event that reverses a chromosomal segment. It can be modeled as a 
discrete operator acting on a signed permutation, reversing the order and sign of an interval of the per- 
mutation. A sequence of reversals that transforms one signed permutation into another one is viewed as a 
possible evolutionary scenario from a genome to another one. Such a scenario is said to be parsimonious 
if no other scenario exists that requires less reversals. 
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Notice that up to relabeling, we can always assume that one of the two permutations is the identity. 
Without loss of generality, we assume that the permutation we want to obtain at the end of a scenario 
is the identity, hence the connection with the sorting problem. Sankoff initiated in |27] the algorithmic 
study of parsimonious reversal scenarios. Since then this problem has been considered by many authors, 
and efficient algorithms exist to compute a parsimonious scenario [T9l [30] . 

Common intervals and perfect scenarios. However, there can be many scenarios that satisfy this 
parsimony constraint. In fact, on real data sets, there can be an exponential number of parsimonious 
reversal scenarios (see |10j for example). This illustrates the need to refine the criteria which defines a 
good evolutionary scenario, and to go beyond the simple parsimony criterion. This need motivates the 
introduction of the perfect scenarios. 

Perfect scenarios aim at avoiding convergent evolution. That is, if groups of genes or other genomic 
markers are co-localized in genomes of two species, a preferred scenario would preserve this quality back 
to the ancestral genome; the group of genes should remain together in every step of the evolution. In 
the combinatorial model based on signed permutations, this appears as a common interval: a collection 
of sequential numbers that forms an interval both in the identity permutation and (in absolute values) 
in the signed permutation to be sorted. A sorting scenario for a signed permutation a is said to break a 
common interval / of a if it contains a reversal such that the elements of I do not form an interval in 
the permutation obtained after the reversal is performed. A scenario that does not break any common 
interval is said to be perfect, and may very well be longer than the shortest, purely parsimonious, 
scenario. However it is considered to have stronger properties as a hypothetical evolutionary scenario. 
The algorithmic problem is thus stated: 

Given a signed permutation, compute a sequence of reversals that sorts it towards the identity 
or reversed identity, does not break any of its common intervals, and is shortest among all 
such scenarios. 

Notice that the permutation obtained at the end of the scenario can be the identity, or the reversed 
identity, which represents the same genome but viewed from the other end. 

Computing perfect scenarios: existing results. The refined problem that asks to preserve only a 
predefined subset of the existing common intervals is NP-complete |16j . Even in the general problem, 
which considers all common intervals, no algorithms with polynomial worst-case time complexity are 
known. However, some fixed parameter tractable (FPT) algorithms have been described 

There also exists some classes of signed permutations that define tractable instances [3HHEI]. Among 
such tractable classes of signed permutations, commuting permutations is the sub-class of signed permu- 
tations that can be sorted by a commuting scenario, i.e. by a perfect scenario with the striking trait that 
the property of being a perfect scenario is preserved even when the sequence of reversals is reordered 
in every possible way. Surprisingly, examples of commuting scenarios arise in the study of mammalian 
genome evolution [3]. 

A link with trees and its applications: new results. The central combinatorial object in the 
theory of perfect sorting by reversals is the "strong interval tree" which tracks all common intervals of a 
(signed) permutation. It serves as a guide for the computation of perfect scenarios and the parameters 
introduced in the FPT algorithms described in [U|5] read naturally in terms of this tree. This link opens 
the way to a refined analysis of some of the existing algorithms for perfect sorting by reversals, which is 
the purpose of our work. 

The two key new results in Section [3] are Theorem [TUl which states that for large enough n, with 
probability 1, computing a perfect scenario for signed permutations can be done in time polynomial in 
n and Theorem 1151 which states that computing a perfect scenario can be done in polynomial time on 
average. Section 0] offers two new results on the average shape of a commuting scenario: we show that 
in parsimonious perfect scenarios for commuting permutations of size n, the average number of reversals 
is asymptotically 1.2n, and the average length of a reversal is asymptotically 1.05y/n. 

We conclude by discussing the relevance of these results, both from theoretical and applied point of 
views, and outlining future research. 
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2 Preliminaries 



We first summarize the combinatorial and algorithmic frameworks for perfect sorting by reversals. For 
a more detailed treatment, in particular for properties of the strong interval tree, we refer the reader to 

a- 

Permutations, reversals, common intervals and perfect scenarios. A signed permutation of 
size n is a permutation of the set of integers {1,2, ... ,n} in which each element additionally has a sign, 
either positive or negative. For clarity, negative integers are represented by placing a bar over them and 
positive signs are omitted. We write our permutations in one line notation. For example, a = [1 3 2 5 4 6] 
is a signed permutation of size 6. We denote by Id n (resp. Id n , Id™) the identity (resp. reversed identity, 
mirrored identity) permutation, [1 2 ... n] (resp. [n ... 2 T], [n ... 21]). When the number n of elements 
is clear from the context, we will simply write Id, Id, or Id m . 

An interval J of a signed permutation er of size n is a segment of adjacent elements of a. The content 
of / is the subset of {1, . . . , n) defined by the absolute values of the elements of /. Given a, an interval 
is defined by its content and from now, when the context is unambiguous, we identify an interval with 
its content. 

The reversal of an interval of a signed permutation reverses the order of the elements of the interval, 
while changing their signs. The length of a reversal is the number of elements in the interval that is 
reversed. If a is a permutation, we denote by a the permutation obtained by reversing the complete 
permutation a. A scenario for a is a sequence of reversals that transforms a into Id n or Id n . The length 
of such a scenario is the number of reversals it contains. A scenario of minimal length is a parsimonious 
scenario. 

Example 1. Let a = [1 4 5 2 3 6] be a signed permutation of size 6, then <r = [6 3 2 5 4 1]. Reversing, 
in a, the interval [5 2 3], or equivalently the set {2,3,5}, yields the signed permutation [1 4 3 2 5 6]. 
Reversing successively {2,3,4} and {3} completes this first reversal to form a parsimonious scenario of 
length 3. 

A common interval of a permutation a of size n is a subset of {1,2, ... ,n} that is an interval in 
both a and the identity permutation Id n . The singletons and the set {1,2, ... ,n} are always common 
intervals called trivial common intervals. 

Example 2. The common intervals of a = [1 3 2 5 4 6] are {2,3}, {1,2,3}, {4,5}, {4,5,6}, {2,3,4,5}, 
{2, 3, 4, 5, 6}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 5, 6}, and the singletons {1}, {2}, {3}, {4}, {5}, {6}. 

Two distinct sets (intervals here) / and J commute if their contents trivially intersect, that is either 
/ C J, or J C /, or / n J = 0. If intervals / and J do not commute, they overlap. A scenario S for 
a is a perfect scenario if no reversal of S breaks any common interval of a, or equivalently [4] if every 
reversal of S commutes with every common interval of a. It is easy to see that there always exists a 
perfect scenario for a given signed permutation. A perfect scenario of minimal length, among all perfect 
scenarios, is a parsimonious perfect scenario. 

A permutation a is said to be commuting if there exists a scenario for cr such that for every pair 
of reversals the corresponding intervals commute. Such a scenario is called a commuting scenario and 
is obviously perfect. It was shown in [3] that, if a signed permutation can be sorted by a commuting 
scenario, then any other perfect scenario for this signed permutation has the same set of reversals, and 
conversely every reordering of the reversals also gives a perfect scenario. This implies that a commuting 
scenario is also a parsimonious perfect scenario. 

Example 3. Let cr = [1 3 2 5 4 6] be a signed permutation of size 6. The scenario {2,3}, {4,5}, {4}, 
{5} is a commuting scenario, and cr is a commuting permutation. 

Remark 4. Commuting permutations have been investigated, in connection with permutation patterns, 
under the name of separable permutations |21j . 

The strong interval tree. First, we remark that the following definitions are valid for both signed 
and unsigned permutations. A common interval / of a permutation a is a strong interval of a if it 
commutes with every other common interval of cr. The inclusion order on the set of strong intervals 
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Figure 1: The strong interval tree T([l 8 4 2 5 3 9 6 7 12 10 14 13 11 15 17 16 18]). Vertices are labeled by 
the strong intervals. There are three non-trivial linear vertices (rectangular) and three prime vertices (round). 
The root and the vertex {6, 7} are increasing linear vertices, while the linear vertices {16, 17} and {13, 14} are 
decreasing. 



of a permutation of size n defines an n-leaf tree, denoted by T(<r), whose leaves are the singletons and 
whose root is the interval containing all elements of the permutation. We require that the elements of 
{1, 2, . . . , n } appear on the leaves of T(cr) from left to right in the same order they do in a. This implies 
that the children of every internal vertex of T(cr) are totally ordered, or in other words that T(cr) is a 
plane tree i.e. a tree embedded in the plane. We identify a vertex of T(er) with the strong interval it 
represents. If a is a signed permutation, the sign of every element of a is given to the corresponding 
leaves in T(<r). Figure [T] shows an example of a strong interval tree. 

Let I be a strong interval of a that is not a singleton and let X = (Jj.,.. . , Ifc) the unique partition 
of the elements of I into maximal strong intervals, from left to right. The quotient permutation of /, 
denoted 07, is the permutation of size k defined as follows: 07(1) is smaller than o~i(j) in 07 if and only 
if any element of the content of Ii is smaller than any element of the content of Ij. A fundamental 
property of the strong interval tree is that the quotient permutation 07 of an internal vertex / having k 
children (k > 2) in a strong interval tree can only be either Idk, Id™, or a permutation of size k, with 
k > 4, whose only common intervals are the k + 1 trivial common intervals. Such a permutation with 
no non-trivial common interval is called a simple permutation. The shortest simple permutations are of 
size 4 and are [3 14 2] and [2413]. We describe simple permutations in more detail in Section 13.11 

For an internal vertex /, if 07 = Idk (resp. Id™, is a simple permutation), then / is said to be an 
increasing linear vertex (resp. decreasing linear vertex, prime vertex). Another crucial property of a 
strong interval tree is that no two increasing (resp. decreasing) linear vertices can be adjacent: if a linear 
vertex is the child of another linear vertex, then one of them is increasing and the other one is decreasing. 

The strong interval tree is also known as the substitution decomposition tree [T] , and is very similar to 
PQ-trees [5], a data structure used to represent the common intervals of two or more permutations [501 
[TlIU]. More precisely, the strong interval tree defines a PQ-tree if linear (resp. prime) vertices are called 
Q-vertices (resp. P-vertices). This PQ-tree can be computed in linear time [7]. To obtain the strong 
interval tree, the quotient permutation of each vertex needs then to be computed. The algorithm of [7] 
can be adapted to compute them, still in linear time. Indeed, given the tree, the quotient permutations 
can be computed as follows: consider the elements on the leaves, from 1 to n, and propagate these 
elements along the edges of the tree towards the root, until a previously used edge is encountered. The 
relative ordering of the elements at every internal vertex of the tree gives the quotient permutation, and 
their computation is obtained in 0{n) time. 



The strong interval tree as a guide for perfect sorting by reversals. The algorithm in [?] 
computing a parsimonious perfect scenario for a given signed permutation is the central object of study 
here, and is henceforth labeled Algorithm BBCP07. 

To compute a parsimonious perfect scenario for a signed permutation a, Algorithm BBCP07 heavily 
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relies on the strong interval tree T(er) of a. It starts with computing this tree, and then assign signs to 
internal vertices according to the following rules: an increasing (resp. decreasing) linear vertex is signed 
+ (resp. — ) and a prime vertex having a linear parent inherits its sign from its parent. Some prime 
vertices may remain unsigned at this step, and the algorithm will explore all the possible assignments 
of signs to these prime nodes. If p denote the number of prime vertices in T(er), there may be up to 2 P 
possible assignments. The key ingredient of the algorithm is that any reversal in a perfect scenario is 
either a strong interval (hence a vertex of T(<r)) or the union of consecutive children of a prime vertex of 
T(ct) 4, Proposition 2]. Hence a scenario can be computed by looking successively at each vertex of the 
strong interval tree, sorting a signed permutation, defined from its quotient permutation and the signs of 
its children, towards either the identity (if the vertex has sign +) or the reversed identity (if the vertex 
has sign — ). More precisely, for each assignment of signs, a scenario is computed as follows: 

• Transform the quotient permutation of each vertex into a signed permutation by lifting the sign of 
each child onto the corresponding element in the quotient permutation. 

• For each prime node signed + (resp. — ) whose signed quotient permutation is r (a signed permuta- 
tion of size fc), compute a parsimonious scenario from r to Idk (resp. to Idk). This is achieved using 
a polynomial-time algorithm solving the general sorting by reversal problem (without the 'perfect- 
ness' condition). The most efficient algorithm so far is the one of [3U], that runs in 0(k\Jk log k) 
time. 

• In addition to the reversals obtained at the previous step, perform a reversal for every interval of 
a that correspond to a vertex (internal vertex or leaf) in T(cr) whose parent is linear and whose 
sign is different from the sign of its parent. 

The scenarios thus obtained are all perfect scenarios, and among them, those of minimal length are 
parsimonious perfect scenarios. For the correctness and complexity analysis of Algorithm BBCP07, we 
refer to [3]- 

Example 5. On the example of Figure]]] the root ofT(a), its two prime children and vertex {6, 7} are 
signed +, whereas vertices {13,14} and {16,17} are signed — . For vertex {2,3,4,5}, the two possible 
sign assignments have to be tested. Choosing sign + (resp. —) produces a scenario with 15 (resp. \A) 
reversals, among which 4 correct a sign mismatch between a vertex and its linear parent (for vertices 
{6}, {13}, {16} and {16, 17} ) and the remaining 11 (resp. 10 ) arise from reversals in prime nodes. More 
precisely, sorting the right-most prime child of the root requires 3 reversals ( through the optimal scenario 
[314 2] — > [413 2] — > [14 3 2] — > [12 3 4]^; when sign + is chosen, the left-most prime child of the root 
is sorted in 4 reversals ([314 2] -> [13 4 2] -> [4 312] -> [213 4] -> [12 3 4] J and its prime child in 4 
reversals ([3142] -> [3124] -> [1324] -> [1234] -> [12 3 4]^; and when sign - is chosen, the left-most 
prime child of the root is sorted in 3 reversals ([314 2] — > [3412] — » [3214] — > [12 3 4] J and its prime 
child in 4 reversals ([3142] ->■ [3412] ->■ [4312] -> [4321] [4321];. Therefore, for the signed 
permutation a of Figure]]] the length of a parsimonious perfect scenario is 14. 

The following proposition is a summary of some of the key results of [3] on Algorithm BBCP07, that 
will play a central role in our work. 

Proposition 6 (Berard et al. 4J). Let a be a signed permutation of size n. Let T(er) be its strong 
interval tree, and denote by p its number of prime nodes. Then the followings are true: 

1. Algorithm BBCP07 compute a parsimonious perfect scenario for a in worst-case time 0(2 P n^/nlogn) ; 

2. a is a commuting permutation if and only if p = 0; 

3. if a is a commuting permutation, then a sorting scenario for a is perfect if and only if it consists 
of one reversal for every interval corresponding to a vertex of T(tr) that has a sign different from 
its parent. 

Hence it appears that prime vertices of the strong interval tree are fundamental in the exponential 
worst-case behavior of Algorithm BBCP07, and more generally in the hardness of the problem of perfect 
sorting by reversals. Indeed, an interpretation of the hardness result given in [TB] in terms of strong 
interval tree is that perfect sorting by reversals is NP-complete for signed permutations whose strong 
interval tree contains only prime nodes. 
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3 On the number of prime vertices 



As we shall soon see, the average-time complexity of Algorithm BBCP07 can also be bounded with 
the aid of strong interval trees. We use enumerative results on simple permutations to determine the 
"average shape" of a tree with n leaves. This average shape is extremely simple and has a single prime 
node. From this we can easily bound the average-time complexity. 

3.1 Combinatorial preliminaries: strong interval trees and simple permuta- 



The following formal description of the underlying structure of the strong interval trees is useful for our 
enumerative analysis. 

Definition 7. Let T n be the family of plane trees satisfying the following properties: 
PI. each tree has n leaves (n is the size of the trees of T n ); 
P2. each leaf is labeled by + or — ; 

P3. the children of each internal vertex are totally ordered; 
P4. each internal vertex has at least two children; 

P5. if an internal vertex has k children, it is labeled either by Idk, or Id™ , or a simple permutation of 
size k if k > 4; 

P6. no edge is incident to two vertices labeled by Id or two vertices labeled by Id m . 

We previously noted that each permutation corresponds to a strong interval tree. We prove next that 
this correspondence is bijective. 

Theorem 8. There is a bijection between the set of signed permutations of size n and T n . 

Proof. First, it is immediate to see that a unique tree of T n can be obtained from a signed permutation 
a of size n. Indeed, it is enough to modify its strong interval tree T(cr) by labeling each leaf representing 
an element of a by its sign, and each internal vertex corresponding to a strong interval I by the quotient 
permutation 07. 

To get a signed permutation ut from a tree T of T n , we assign signed integers to its leaves and o~t 
will be obtained by reading the leaves from left to right. The absolute values of the integers labeling the 
leaves are obtained by a top-down approach. We first assign the set of integers I = {1, ... n} to the root, 
together with a variable m set to 1 indicating the minimal value of /. We propagate this assignment from 
the root to the leaves as follows. Consider a node labeled by a permutation r with k children rooting 
subtrees of sizes si, . . . ,Sk from left to right, that has been assigned the set I of consecutive integers and 
the variable m = min(7). Then assign sets I%,...,Ik and variables mi, . . . , mu to its children so that 
m, = m + Ylj-rU)<r(i) S J an< ^ I* = { m ii ■ ■ ■ ,TUi + Si — 1}. At the end of this process, every leaf is labeled 
by an integer m and a set I — {m}. The signed integer assigned to such a leaf is then either m if the 
leaf has label + in T or — m if it has label — . Notice that the sets I assigned to the nodes of T actually 
correspond to the strong intervals of o~t, ensuring that the above mapping is a bijection. 



Recall that simple permutations are the permutations that have no non-trivial common interval, and 
are used here as quotient permutations of prime nodes. The enumeration of simple permutations was 
investigated in [2J. The authors prove that this enumerative sequence is not P-recursive and there is no 
known closed formula for the number of simple permutations of a given size. Nonetheless they are able 
to compute a complete asymptotic expression for the number of simple permutations of size n. 

Theorem 9 (Albert et al. [2j). Let s n be the number of simple permutations of size n. Then 



tions 



□ 




when n — ¥ oo. 
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3.2 Average shape of strong interval trees 



A twin in a strong interval tree is a vertex of degree 2 such that each of its two children is a leaf. Thus, 
a twin is a linear vertex. 

Let us notice that all results in this section apply both to signed permutations and unsigned permu- 
tations: the two mains reasons for it are that the definition of intervals in a permutation ignores the 
signs of the elements, and that 2™ signed permutations are associated to any unsigned permutation a of 
size n, and this number does not depend on a. 

We first state the main result of this section. 

Theorem 10. Asymptotically, with probability 1, a random permutation of size n has a strong interval 
tree of the form: 

• the root is a prime vertex; 

• every child of the root is either a leaf or a twin. 

Moreover, the probability distribution of the number k of twins is given by: P(k) — -tit- Consequently, 
the expected number of twins is 2. 

Before proving this result, we can notice that it overlaps with previous results on the expected number 
of common intervals in permutations. In their paper introducing the problem of computing the common 
intervals of a permutation [32] , Uno and Yagiura showed that the expected number of common intervals 
of length 2 in a permutation is 2 — 2/n and that, for all £ > 2, the expected number of common intervals 
of size £ is for n large enough. This implies immediately our result on the shape of the strong interval 
tree. Later, Corteel et al. showed in |12j that the probability distribution of the number of common 
intervals of size 2 follows a Poisson law, with mean 2, a result already proved by Kaplanski, in relation 
with runs in permutations [2 2) . A similar result was also proved independently in [34 . Theorem [10] 
gathers all these results together, expressed in terms of the strong interval tree. Moreover, the proof we 
give here is new and relies on enumerative results on simple permutations. 

The proof of Theorem [TU] follows from Lemma [11] below and Theorem [HI 

Lemma 11. If p n ,k denotes the number of unsigned^ permutations of size n which contain a common 
interval I of length k then for any fixed positive integer c: 



Proof. The proof of Lemma [TT] is essentially identical to Lemma 7 of [2]: We have p n ^ < (n — k + 
l)k\{n — k + 1)!. Indeed, the right-hand side counts the number of quotient permutations corresponding 
to / (which is kl), the possible values of the minimal element of / (n — k + 1) and the structure of the 
rest of the permutation with one more element for the insertion of / ((n — k + 1)!). Only the extremal 
terms of the sum can have magnitude (D(n~ c ) and the remaining terms have magnitude C(n _c ~ 1 ). Since 
there are fewer than n terms the result of Lemma [TT1 follows. □ 

Proof of Theorem \1(A Lemma [TTJ with c = 1 gives that the proportion of non-simple permutations with 
at least one common interval of size greater than or equal to 3 is C(n _1 ). But permutations whose 
common intervals are only of size 1,2 or n are exactly permutations whose strong interval tree has a 
prime root and every child is either a leaf or a twin. 

Similarly, the number of permutations whose strong interval tree has the form of a prime root with k 
twins is s„_fc("~ fc )2 fc . Given the asymptotic estimate of s n in Equation (TJJ, we compute the asymptotic 

estimate for the number of such permutations to be txj- > proving Theorem [TU] □ 

This result has an immediate corollary in terms of perfect sorting by reversals: the probability that 
a signed permutation corresponds to an instance that requires an exponential time computation to be 
solved tends to as n grows. 

Corollary 12. Algorithm BBCP07 runs in 0(n^/nlogn) time with probability 1 as n — > oo. 
1 For signed permutations, the denominator n! should be replaced by 2 n n\. 
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3.3 Average time complexity of perfect sorting by reversals 

Further analysis of the tree family T n yields a polynomial bound on the average-time complexity of 
Algorithm BBCP07 fTheorem fTS)) . 

Consider the following sum, which is central in the description of the complexity of the algorithm: 



Pn 



Here T n is the number of strong interval trees with n leaves (T n = \T n \ = 2 n n\ from Theorem [5]) and 
T„ iP is the number of such trees with p prime vertices. The key step in the algorithm complexity result 
is essentially reduced to showing P n e 0(1). 

As an intermediate step, we find a bound on U ntP , the number of unsigned permutations of size n 
whose strong interval trees contain p prime vertices, when p > 2. 

Lemma 13. The number U njP of unsigned permutations of size n whose strong interval trees contain p 
prime vertices with p> 2 is at most 48 2^ ■ 



Proof. We proceed by induction on the number p of prime vertices. The hypothesis is the following: 

(H p ) :Vn,[/„, p <48^9^- 
The hypothesis (H p ) is trivially true for n < 3p+l, since a tree containing p prime vertices has at least 

3p + 1 leaves. We initiate the proof with p — 2 assuming n > 7. A tree of size n with two prime vertices 

can always be decomposed, although not uniquely, as a tree T\ that contains one prime vertex, where 

one leaf is chosen and expanded by a second tree T 2 with one prime vertex. Hence |Ti| + |T 2 | = n+ 1. 

Without loss of generality, one can assume that the root of T 2 is its only prime vertex. Recall that the 

number of trees with one prime vertex with k leaves is at most k\, as such trees are in bijection with a 

subset of unsigned permutations of size k. Hence, 



< 
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4 ) k=4 

24(n-l)! (n-3)(n-2) 



(n- l)(n-2) 



< 48 



fc=0 

(n-1)! 



Let us now suppose (Hp) true and prove (H p +i). We proceed as before. Indeed, a tree with p + 1 
prime vertices can be decomposed - not necessarily uniquely - as a tree T\ with p prime vertices, one 
leaf of which is expanded by another tree T 2 with one prime vertex. As explained before, we can assume 
that n > 3(p + 1) + 1. Hence: 



U, 



n,p+l 
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A straightforward analysis by successive derivations on n shows that 



3-2) 



2w(ra+i) < for all n > 10. Hence, since p > 2, we deduce that 



< 



i 



2n(n+l) 



for all n > 3(p + 1) + 1. This ensures that U n ,p+i < ^"Ti^ an d concludes the proof. □ 

In the context of signed permutations, Lemma [TBI immediately yields the following result: 

Lemma 14. The number T n>p of signed permutations of size n whose strong interval trees contain p 
prime vertices with p > 2 is at most 2"48 ^ n 2 p^' ■ 

Theorem 15. Computing a shortest perfect scenario for a random signed permutation can be done with 
average time complexity bounded by O(n\/n\ogn). 

Proof. First we bound P n . For all n, by Lemma 1141 



P, 



< 



Z^ip^ ± n,p 

(T n ,o + 2T nA + ^ =2 2"48(n - 1)! 



< 



3+y — 

p=2 

= 3 + 48(1--) 
n 

Thus, P n £ 0(1). The average time complexity of Algorithm BBCP07 for permutations of size n is given 
by the following sum, for some constant C: 

(j » 

— 22 T ntP 2 p n^nlogn = C P n n^Jn Xogn. 

p=0 

The result follows since P n £ C(l). 

□ 

4 Properties of commuting scenarios 

We observed in the previous section that the typical shape of the common interval tree associated to 
a random permutation is very particular, and it is reasonable to ask if a signed permutation selected 
uniformly at random adequately represents the expected shape of an evolutionary scenario. Experi- 
mentally, the strong interval trees that arise when comparing pairs of mammalian genomes contain few 
prime nodes, labeled with small, simple permutations. Rather, they contain large subtrees with no prime 
nodes. These subtrees represent commuting scenarios. At present we are unaware of a weighting operator 
on signed permutations which correlates to the probability that such a permutation could represent an 
evolutionary scenario on real data. Indeed, such an operator would greatly aid in determining realistic 
run-times for algorithms on biological data and other properties of evolutionary scenarios. Towards this 
goal we begin by investigating the class of strong interval trees with no prime nodes. These correspond 
to commuting scenarios. 

The trees that represent commuting scenarios are particularly well-studied. They fall into the category 
of simple varieties of trees, and as such, many formulas exist to compute quantities such as the asymptotic 
number of trees with n leaves, and also distributions associated to various tree parameters. Some of these 
parameters have direct relevance to the evolutionary scenario interpretation. Chapter [T7J Section VII. 3] 
is a pedagogical reference for simple varieties of trees and we outline how to derive some key values here. 

In the remainder of the section, we prove the following results on parsimonious perfect scenarios 
sorting a commuting signed permutation of size n, via common interval trees: 
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1. The asymptotic number of commuting permutations is 2™ +1 • 0.12 (5.88)" n 3 / 2 (a very typical 
expression for trees) (Equation [5]); 

2. The average number of reversals in one of these scenarios is 1.2 n (Theorem I16[) . This is a conse- 
quence of the average number of internal vertices in the tree (Equation [5]); 

3. The average length of a reversal is 1.05 %/n. This is related to the average pathlcngth of a tree. 
(Theorem [T7]) 

Additionally, in the proof of Theorem [T21 we can determine that 37% of the expected reversals have 
length 1. This agrees with the observation of a large proportion of short reversals in parsimonious 
scenarios for bacterial genomes [23] . 

Finally, a note on convergence. The asymptotic estimates we present converge quickly, even for 
relatively small n. For example, the estimate given for the number of commuting permutations is correct 
up to order 0(n~ 5 / 2 ). In real terms, at n = 100 it is within 3% of the real value. The parameters have 
a similar accuracy. The trees that arise from biological data have on the order of 1000 leaves (see [23] 
for example), and hence these are very strong estimates. 

4.1 Modified Schroder Trees 

Let o~ be a commuting permutation of size n, equivalently, a signed permutation whose strong interval 
tree T(cr) has no prime node. Thus, T(cr) is a plane tree with the property that the internal vertices have 
at least two children, each leaf is signed either + or — , and the root is also signed + or — (to indicate 
whether it is an increasing or a decreasing linear vertex). The signs of the other internal vertices follow 
unambiguously from the sign of the root, alternating between + and — along each branch of the tree. 

Disregarding the signs on the leaves and root, this family of trees is known as Schroder trees (entry 
A001003 in the On-Line Encyclopedia of Integer Sequences [28 ), and they are straightforward to analyze. 

Let C be the class of all strong interval trees representing commuting permutations, and let S be the 
class of Schroder trees. If C n and S n respectively denote the number of trees with n leaves in these two 
classes, then 

C n = 2-2 • S n . 

Because of this exact {1 : 2™ +1 } correspondence, we generally first consider the class S to determine 
structural properties, and then account for the contribution from the leaves. We remark that S is a 
subset of the trees T. 

4.2 A specification for S 

Like many tree classes, there is a simple recursive description for the class S of Schroder trees: A tree is 
either a leaf (denoted £), or an internal vertex with at least two subtrees, all of which are elements of 
S. A visual representation of this statement is given on Figure [5J 




Figure 2: A Schroder tree is decomposed as either a leaf, or an internal vertex with a sequence of subtrees. 

We say that the size of a tree t <E S is the number of leaves, and we denote this quantity by |t|. We 
shall later consider the number of internal vertices. A leaf is an atomic structure of weight one, and an 
internal vertex is a neutral structure of weight 0. We translate the above picture description of S into 
the following combinatorial equation: 

5 = £ + Seq> 2 (5), (2) 
where Seq >2 (<S) represents a sequence (total order) of at least two trees of S. 
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4.3 A crash course on decomposable structures 

The formalism we are using here is well described in [17 . The main advantage is the direct access to 
functional equations for the ordinary generating functions. Recall, if S n is the number trees in S with 
n leaves, the ordinary generating function (ogf) is defined as the formal power series S(z) — yL S n z n . 
Thus, the series expansion of S(z) begins S(z) = z + z 2 + 3z 3 + llz 4 + • • • . Many combinatorial actions 
described on combinatorial classes have companion actions on their generating functions. To summarize, 
suppose that A is a combinatorial class with some notion of size, and that A n is the number of objects 
of size n. Let A(z) be the series J2 n A n z n . If we can express A in terms of other combinatorial classes, 
then we can do likewise for its ogf. For example, if A is the disjoint union of two classes: A = B\±)C, the 
associated generating functions satisfy the simple relation A(z) — B(z) + C(z). If A is described using 
the cartesian product and the size is additive, A = B x C = {(f3, 7) : j3 £ B, 7 £ C}, then the ogf satisfy 
A(z) — B(z)C(z), the usual product for formal power series. Finally, if class A is a sequence of objects 
from class B, that is, A = Seq(B) = {{Pi, ■ ■ ■ ,fik) '■ < k, fa £ B}, then there is the generating function 
correspondence 

A[z) = r^W 

This is the mere surface of a vast theory rooted in the foundational work of Chomsky and Schutzenberger 
and their study of algebraic equations related to context free grammars, but significantly advanced and 
summarized as the theory of decomposable structures in [17] . 

This is a particularly robust formalism: we can create recursive functional equations, and we can 
easily pass information about additional parameters. We do both of these here. 

4.4 Enumeration formulas 

We easily translate the combinatorial description in Eq. ([2]) into the functional equation^ 

This converts to a simple quadratic equation in S(z). There are two solutions and we choose the one 
with a Taylor series expansion at with positive integer coefficients, i.e. a generating function solution. 
This is 



3 + V8/ V 3-V8 

In order to determine expressions for the asymptotic growth, we follow exactly the procedure outlined 
in [T71 Chapter VI. 1], in particular the flow chart of [T71 Figure VI. 7]. We outline the three main steps 
of the analysis, but readers interested in further details are referred to this resource. 

The first step is to determine the dominant singularity. This is the smallest positive real-valued 
singularity, which in this case is 3 — y/8. 

The second step is to determine the behavior of the function around its dominant singularity, 3 — v& 



1 

S(z) - yVT8 4 (l - 3 as z~3-78. (5) 

We are in a context where asymptotic transfer theorems (TTJ VI. 3] apply, and hence we move to the 
final step. The approximation of the function near this singularity yields an asymptotic approximation 
for its coefficients in the Taylor expansion around 0. Roughly, we adapt the following correspondence 

/ \ ~ a a — l 

F(z) ~ ( 1 - - ) as z~p=^ [z n ]F(z) - p~ 



In this notation, [z n ] extracts the coefficient of z n in the series expansion of the expression that im- 
mediately follows, and T is the Gamma function. Using the approximation of S(z) near its dominant 



2 Here we have used that Seq >2 (5) = S X S X Seq(>S). 
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singularity from Eq. ([5]), we deduce 



S n = [z n ]S(z) ~ ( ]JVl8-4) (3- VS)""^-^ ~ 0.12 (5.88)" n-t. (6) 



4 

An asymptotic approximation of the number C n of signed commuting permutations of size n is 
obtained by multiplying the above equivalent by 2 n+ . 

4.5 Tree parameters: A primer 

We study the average value of different tree parameters with a common strategy, which we briefly outline 
here. Let \ '■ S ~> N be an non-negative integer valued function that records some combinatorial property 
of a Schroder tree, such as the number of internal vertices. The main tool here is the bivariate generating 
function 

S(z, U) = J2 " X(T) 2 |T| = £ Sk,nU k Z n , 
t£<S k,n 

where Sk, n is the number of Schroder trees with n leaves, with \ value equal to k. Of course, S(z) — 
S(z, 1) and S n = J2k>o ^k,n- Now, if E„(x) is the expected value of \ over an objects of size n in S, 
then by definition 

w i \ J2k>0 kSk,n 
Z^fe>0 °k,n 

We have access to this from the bivariate generating function. Remark, 



{z n ]£S(z,u)\ u=1 



hence 

The denominator of this expression is calculated in Equation [6j and in our examples the numerator is a 
coefficient extraction of an algebraic function of z, hence the three steps described in the previous section 
apply. Indeed, in our two examples, the dominant singularity is the same as in S(z), 3-\/8. 

This is also a robust approach, and upon considering higher derivatives we can obtain higher moments. 

4.6 The average number of internal vertices 

We begin by considering the parameter \ equal to the number of internal vertices in a Schroder tree. We 
can augment the specification in Eq. ([2]) with a neutral marker fi of weight to tag the internal vertices: 

S = C + ^l x Scq> 2 (5). (7) 

We now consider the bivariate generating function S{z, u) where u marks which counts the total 
number of markers. Eq. ([7} translates to a functional equation for S(z,u): 

n, \ S(z,u) 2 
b(z,u) = z + u- 



1 — S(z, u) 

This is solved in a similar manner to S(z): 



z + l- JTz + l) 2 - Az(u+ 1) , , 9n , 

S(z, u) = '— i '- = z + uz 2 + (u + 2u 2 )z 3 + 

2{u + 1) 

This expression can be differentiated to determine 



8 (Z-1) 2 -(Z + I)y/(Z + 1)*-8Z 

-^-b(z,u)\ u =i = 



thi 



3+\/8 A 3-%/8 ; 
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As we remarked above, the singularity analysis follows as before using the singularity 3 — \/8. Thus, 

d ( z V 1/2 

— S(z,u)\ u =i ~ a ■ I 1 F ) as z ~ 3 - V8, 

ou V 3 - V8/ 

where the constant a is determinnd by evaluating the rest of the expression at z = 3— \f%. The third step, 
applying the transfer theorem, is the performed. To determine the average, we divide this expression by 
the asymptotic number of trees, as determined in Equation (|6]). We simplify the radicals, and obtain 

[z n ]-g-S(z,u)\ , 3-V8 n 
4.7 The average number of reversals. 

Next we use the average number of internal vertices to count the average number of reversals in a scenario. 
An evolutionary scenario is obtained from tree in S by signing the root, and the leaves. Each internal 
vertex, except the root, represents a reversal. A leaf represents a reversal if and only if it has a sign 
different from the sign of its parent. 

The number of internal vertices is a good first approximation for the number of reversals. Asymp- 
totically, since the average number of vertices is a linear function of n, subtracting by one to account for 
the root has little or no effect. 

In order to account for the reversals at the leaves, we remark that for any tree in S of size n, we 
consider all 2 n possible ways of assigning signs to the leaves, and from this symmetry we deduce that on 
average this adds n/2 reversals, all of length 1. 

We put all of these pieces together in the following theorem. 



Theorem 16. The asymptotic average number of reversals in a parsimonious perfect scenario of a 

2 



random signed commuting permutation of size n is n /a/2 + n/2 = ^^n. On average there are n/2 
reversals of length 1 . 



4.8 The average length of a reversal 

The length of a reversal in a scenario is equal to the size (number of leaves) in the corresponding subtree 
of the common interval tree. Again, our analysis first estimates by studying a parameter on unsigned 
trees, S, and then tunes by considering the reversals of size one represented by signs on the leaves. 

Our analysis is guided by the study of a related parameter called pathlength that frequently makes 
a cameo appearance when trees are used to study sorting algorithms. 

Let vE'(t) be the sum of the subtree sizes for all subtrees in r e S. Examining Figure [2j we can 
formulate a recursive description of ^(t). Consider t E S. Either r is a single leaf, in which case ^(r) = 
0, or the root has m children, labeled left to right by t±, . . . , r m . To compute ^&(t), we sum the sizes of 
the subtrees of each child of the root, and then add the size of the entire tree, which of course is the sum 
of the sizes of the children. This is written 

m 

*(r)=£(*fo) + M). 

3=1 

A tree parameter that satisfies such a relationship is an additive parameter and writing the corre- 
sponding functional equation is straightforward. We mark the parameter ^ by the variable v in the 
bivariate generating function S(z,v): 

S{z, v) = Yl «* (T)z ' T| = z + v 2 * 2 + (« 3 + + • • • (9) 

This parameter is identical to the pathlength parameter, and the steps from the generating function to 
the equation are all well-explained in [TTJ Section III. 5]. We derive the functional equation 

\ S(vz,v) 2 
S(z,v)=z+ - y 10 

1 — b(VZ, V) 



13 



Rather than solve for S(z, v), it is easier to solve for J^S(z, u)|„=i directly by differentiating Eq. (TTU)) 
with respect to z and v, and setting v = 1 in the resulting equations. This leads to two equations 
in two unknowns. Using the notation S v (z) — J^jS(z,v)\ v= i, S z (z) — -§^S(z, v)\ v —i, and recalling 
S(z, 1) = S(z), the counting ordinary generating function for S, this leads to the system 

S(z)(2 - S(z))(S z (z)z + S v (z)) 



S z (z) = l + 



(i-s(z)y 

S(z)S z (z)(2-S{z)) 



(i-s(z)r ■ 

We solve S v (z) in terms of S(z): 

zS(z)(2-S(z))(l-S(z))> 2 3 , 80z 4 
bv{z) ~ (l-4S(*)+2S(z) 2 ) 2 -2^ +13z +80z +... 

This is an explicit expression to which we apply singularity analysis to determine an expression for the 
coefficient of z n : 

[z n ]-^-S(z,v)\ v=1 = [z n ]S v {z) ~ ^(3- V8)~ n . 

This value approximates the sum of the sizes of all subtrees of all trees in S. The average value of $ is 
the quotient 

- - (t|(3 - V8)^ ( 7V^^(3 ~ v/8)-- 1 

* 3 

: «5 ~ 1.27 «5 




4a/3-a/2 

To get the expected sum of the lengths of the reversals of a parsimonious perfect scenario for a 
random commuting permutation, we consider adjustments that occur for each tree in 5, so we can add 
them directly to this value. For each tree we remove the size of the whole tree (n) since we do not count 
this as a reversal, and we also add the average contribution of the reversals of size 1 (n/2). These two 
adjustments do not affect the asymptotic growth since dominates n for large n. 

To determine the the average length, we now divide by the average number of reversals, which we 
determined to be 1+ ^ n in Theorem [Trjl We summarize these results in the following theorem. 

Theorem 17. The average length of a reversal in a parsimonious perfect scenario for a random signed 
commuting permutation of size n is asymptotically 

- 1.054 Vn. 



2(1 + y/2) a/3 -y/2 



5 Conclusion 

Summary Perfect sorting by reversals, although an intractable problem, is very likely to be solved 
in polynomial time for random signed permutations, under the uniform distribution. This result relies 
on a study of the shape of a random strong interval tree that shows that asymptotically such trees are 
mostly composed of a large prime vertex at the root and small subtrees. We were also able to give 
precise asymptotic results for the expected lengths of a parsimonious perfect scenario and of a reversal 
of such a scenario for random commuting permutations. Our results were obtained using techniques of 
enumerative and analytic combinatorics. 



Discussion on our results. Recently, several works have investigated average properties of combi- 
natorial objects related to genomic distance computation, such as the breakpoint graph [29j [33l [35] . 
conserved segments [31] or adjacencies and common intervals |12l 134] . The motivation for such works 
can be twofold. One can be interested in the expected behavior of some algorithms, such as in [53], that 
shows that the most intricate part of the theory of sorting by reversals (clearing hurdles and fortress) is 
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not required on uniform random permutations. Our results on the average complexity of computing a 
parsimonious perfect scenario belong to this family of results. In other cases, one can be interested in 
the expected properties of an evolutionary scenario for random genomes |33U35| . This allows, given real 
data, to assess the significance of the comparison of a pair of genomes and to compute statistical tests 
measuring the evolutionary signal left: intuitively, if a scenario between two real genomes looks like a 
scenario between random genomes, one can make the hypothesis that there is little to no evolutionary 
signal left in the considered pair of real genomes. Our results on commuting permutations are of such 
nature. 

The fact that computing a parsimonious perfect scenario requires polynomial time on the average is 
mostly a theoretical result, that completes the complexity analysis of the problem. Indeed, real data sets 
(pairs of genomes or genome segments) are in general not expected to define strong interval trees with 
a large number of prime nodes (see [23] for example). So the algorithms described in were already 
known to be efficient on real data sets. 

It should however be noted that our results on the expected shape of a strong interval tree, and in 
particular on the number of prime vertices, generalizes previous results on conserved adjacencies and 
common intervals in permutations [3S[ [T^l [31]. They could form the basis for a deeper study of the 
expected shape of the strong interval tree, parametrized by the number of common adjacencies or of 
prime nodes. Also, the combinatorial specification of the class of strong interval trees opens the way 
to random generation algorithms [IB] of trees with some prescribed structure (such as the number or 
maximum degree of prime nodes), that we outline in the paragraph below on future research. This might 
allow to study by simulations the expected properties of a perfect scenario between pairs of genomes 
defining strong interval trees with a prescribed structure. Such way to assess the significance of features 
of a hypothetical scenario between real genomes is clearly of practical interest. 

In the same vein, the strong interval trees obtained in comparing pairs of mammalian genomes for 
example |23j contain very few prime nodes, and then contain large subtrees that represent commuting 
permutations; these subtrees can then be compared to the expected properties of random commuting 
permutations to point at genome segments whose evolution is significantly non-random. 

Future research. There exists several more general models of genome rearrangements [TS]. Among 
them, the more general is based on an operation called Double- Cut-and- Join (DC J for short) that models 
reversals and several other types of rearrangements. The notion of perfect DC J scenarios has been studied 
in [5] and has the intriguing property that instances that were hard to solve for reversals can be solved in 
polynomial time in the DCJ model and conversely. It would then be interesting to compare the average 
time complexity of perfect sorting by DCJ to the results we describe in the present work. 

We could, modulo the labeling of the prime nodes by simple permutations, easily describe T n using 
a grammar in the combinatorial calculus described in [T7]. This would give access to enumerative and 
structural information, when paired with the generating function for simple permutations. Generating 
these trees directly, i.e. without first generating the corresponding permutation, remains an interesting 
open problem, that seems to be well suited to Boltzmann random generation techniques (14) . 

Also, PQ-trees are a natural family of trees that are both related to common intervals of permuta- 
tions |T and used in comparative genomics |23j . Investigating average properties of PQ-trees is a natural 
extension of the work presented here. 

More generally, average properties of the many families of combinatorial objects that appear in 
comparative genomics models and algorithms is an almost completely open field, that contains many 
challenging problems and deserve being investigated. 
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